In [1]:
%%HTML
<script src="require.js"></script>
In [2]:
from IPython.display import display, HTML
HTML(
    """
    <script
        src='https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js'>
    </script>
    <script>
        code_show=true;
        function code_toggle() {
        if (code_show){
        $('div.jp-CodeCell > div.jp-Cell-inputWrapper').hide();
        } else {
        $('div.jp-CodeCell > div.jp-Cell-inputWrapper').show();
        }
        code_show = !code_show
        }
        $( document ).ready(code_toggle);
    </script>
    <form action='javascript:code_toggle()'>
        <input type="submit" value='Click here to toggle on/off the raw code.'>
    </form>
    """
)
Out[2]:
In [3]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from collections import Counter

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
In [23]:
# Set helper functions
def plot_3d(df_new, y_predict=None):
    """
    Create a 3D scatter plot using Plotly.

    Parameters
    ----------
    df_new : DataFrame or array-like
        Input data for the 3D scatter plot.

    y_predict : array-like, optional
        Array of values used for coloring the markers.
    """
    fig = go.Figure(data=[go.Scatter3d(
        x=df_new[:, 0],
        y=df_new[:, 1],
        z=df_new[:, 2],
        mode='markers',
        marker=dict(size=5,
                    color=y_predict,
                    opacity=0.8))])
    
    fig.update_layout(margin=dict(l=0, r=0, b=0, t=0),
                      scene=dict(
                          xaxis_title='PCA 1',
                          yaxis_title='PCA 2',
                          zaxis_title='PCA 3'))
    fig.show()

def plot_dendrogram(Z):
    """
    Plot a truncated dendrogram using Ward's Method.

    Parameters
    ----------
    Z : array-like
        The linkage matrix representing the hierarchical clustering.

    Returns
    -------
    matplotlib.axes._axes.Axes
        The matplotlib axes containing the dendrogram.
    """
    fig, ax = plt.subplots()
    dn = dendrogram(Z, ax=ax, p=5, truncate_mode='level')
    ax.set_xlabel(r'Datapoints')
    ax.set_ylabel(r'$\Delta$')
    ax.set_title("Figure 4. Dendrogram for Ward's Method")
    return ax
No description has been provided for this image

Abstract

The study delves into the critical task of identifying distinct customer segments within a business's consumer base and understanding their key characteristics, purchase behaviors, and preferences. By employing hierarchical clustering with Ward's method on a dataset encompassing diverse aspects such as people, products, promotion, and place, the analysis reveals two main customer clusters with multiple subclusters. The analysis reveals two main clusters, "Affluent Traditionalists" and "Digital Economizers," with the second being further clustered into two sub-clusters, "Budget-Focused Families" and "Flexible Consumers", each presenting unique characteristics and behaviors.

Recommendations are tailored for targeted marketing and engagement efforts for each cluster. For the Affluent Traditionalists, the focus should be on premium offerings, personalized services, and enhancing the in-store experience. On the other hand, Digital Economizers require strategies revolving around digital engagement, value-based promotions, and accommodation of the needs of budget-conscious families. The study outlines limitations, such as potential data diversity constraints and the absence of qualitative insights, urging future research to address these gaps. Additionally, advanced methodologies like predictive analytics, machine learning models, and qualitative research methods are recommended to enhance segmentation accuracy and provide deeper insights into customer behaviors and preferences. Integrating external factors and mapping customer journeys can further refine the understanding of market dynamics. Feature engineering and utilizing various clustering methods and preprocessing techniques are also suggested to broaden the scope of analysis.

In conclusion, the study not only identifies customer segments but also provides strategic recommendations, outlines limitations, and suggests advanced methodologies for future research, offering a comprehensive roadmap for businesses aiming to tailor their strategies and stay relevant in a dynamic market landscape.

Problem Statement

One of the first and most important questions every business must ask themselves is who they believe their target market is, followed very closely by the next question of what value are they are able to offer to said target market. By initially identifying who their customers are, the business is able to work towards attracting, catering to, and retaining them.

However, more often than not, a company's consumer is not limited to a single type of customer profile. Most customer bases would consist of several different groups that share distinct characteristics, purchase behaviors and product preferences. With these differences, the business can take advantage of this information to enhance their marketing, sales and operations strategies to tailor the customer experience of each valued segment.

With this in mind, the business should ask themselves the following:

What are the different segments of a company’s customers, and what are their key characteristics that differentiate them from each other?


The report aims to answer this problem, as well as give further insight on the next possible steps the business can take upon identification of the customer segments.

Motivation

There are several reasons on why identifying the different segments of a business' customer base is important. Firstly, it allows the company to conduct targeted marketing and sales efforts depending on their ideal customer, wherein they can design both products and campaigns that their target customers would most likely engage with. A marketing strategy with focus gives the company the ability to optimize how they spend their budget and resources to reach that specific niche. Additionally, understanding the preferences of several customer segments allows more flexibility for the company in terms of resource allocation. Customer experience, and in the long-term, retention can also be improved with the knowledge of customer profiles, as understanding their behaviors and preferences can help the company design programs with the aim of keeping them engaged and content.

The challenge is that the consumer base is constantly evolving and changing. What could be customers of a business now may look very different from a few years ago or how it would look like in the future. A business who strives to understand their customer and its several segments can cater to these diverse set of needs. Overall, the aim is to provide data to the company that could be leveraged to make informed decision-making when it comes to keeping their customers satisfied and provide the company with a competitive advantage.

Data Source

The data used for the study is available via Asian Institute of Management (AIM) Jojie-collected public datasets under the directory /mnt/data/public/customer-personality-analysis/marketing_campaign.csv and was loaded via pandas. The data can also be found via Kaggle (Patel, 2021). As described in the website, the dataset contains four main categories: people, products, promotion and place.

Each category type pertains to the following:

  • People: Identifiers of each customer such as their ID, date of birth and other personal features relating to the customer themselves.
  • Products: Relates to the amount of money spent differing products such as fruit or meat.
  • Promotion: Refers to customer behavior towards marketing campaigns and discounts.
  • Place: Number of purchases a customer makes via different channels such as the store or via the web.

Data Exploration

Displayed in Table 1. is a snapshot of the Customer Personality dataset, loaded via pandas from the csv file. The raw data contains around 2240 row entries representing each customer and 29 columns representing the different customer features.

Table 1. Snapshot of Customer Personality Analysis Dataset
In [5]:
df = pd.read_csv('/mnt/data/public/customer-personality-analysis/'
                 'marketing_campaign.csv', sep='\t')
df.columns = df.columns.str.lower()

display(df.head())
print(f'Data dimensions: {df.shape}')
id year_birth education marital_status income kidhome teenhome dt_customer recency mntwines ... numwebvisitsmonth acceptedcmp3 acceptedcmp4 acceptedcmp5 acceptedcmp1 acceptedcmp2 complain z_costcontact z_revenue response
0 5524 1957 Graduation Single 58138.0 0 0 04-09-2012 58 635 ... 7 0 0 0 0 0 0 3 11 1
1 2174 1954 Graduation Single 46344.0 1 1 08-03-2014 38 11 ... 5 0 0 0 0 0 0 3 11 0
2 4141 1965 Graduation Together 71613.0 0 0 21-08-2013 26 426 ... 4 0 0 0 0 0 0 3 11 0
3 6182 1984 Graduation Together 26646.0 1 0 10-02-2014 26 11 ... 6 0 0 0 0 0 0 3 11 0
4 5324 1981 PhD Married 58293.0 1 0 19-01-2014 94 173 ... 5 0 0 0 0 0 0 3 11 0

5 rows × 29 columns

Data dimensions: (2240, 29)

Data Overview

The names, data type, descriptions and category of each feature is detailed under Table 2., using descriptions and categories provided in Kaggle as reference (Patel, 2021).

Table 2. Description of Customer Personality Analysis Features
Feature Name Type Description Category
id integer Customer's unique identifier people
year_birth integer Customer's birth year people
education object Customer's education level people
marital_status object Customer's marital status people
income float Customer's yearly household income people
kidhome integer Number of children in customer's household people
teenhome integer Number of teenagers in customer's household people
dt_customer object Date of customer's enrollment with the company people
recency integer Number of days since customer's last purchase people
mntwines integer Amount spent on wine in last 2 years products
mntfruits integer Amount spent on fruits in last 2 years products
mntmeatproducts integer Amount spent on meat in last 2 years products
mntfishproducts integer Amount spent on fish in last 2 years products
mntsweetproducts integer Amount spent on sweets in last 2 years products
mntgoldprods integer Amount spent on gold in last 2 years products
numdealspurchases integer Number of purchases made with a discount promotion
numwebpurchases integer Number of purchases made through the company’s website place
numcatalogpurchases integer Number of purchases made using a catalogue place
numstorepurchases integer Number of purchases made directly in stores place
numwebvisitsmont integer Number of visits to company’s website in the last month place
acceptedcmp3 integer 1 if customer accepted the offer in the 3rd campaign, 0 otherwise promotion
acceptedcmp4 integer 1 if customer accepted the offer in the 4th campaign, 0 otherwise promotion
acceptedcmp5 integer 1 if customer accepted the offer in the 5th campaign, 0 otherwise promotion
acceptedcmp1 integer 1 if customer accepted the offer in the 1st campaign, 0 otherwise promotion
acceptedcmp2 integer 1 if customer accepted the offer in the 2nd campaign, 0 otherwise promotion
complain integer 1 if the customer complained in the last 2 years, 0 otherwise people
z_costcontact integer Unknown variable others
z_revenue integer Unknown variable others
response integer 1 if customer accepted the offer in the last campaign, 0 otherwise people

Assumptions:

  • Amounts are assumed to be for US dollars (USD).
  • For products-category features, amount spent on products is assumed to be the monthly average for the past 2 years.
  • For place-category features, number of purchases is assumed to be on a monthly average basis.

For object-type, categorical features such as education and marital_status, the values are listed down in Table 3.

Table 3. Values under Categorical Features
education marital_status
Basic Single
2n Cycle Together
Graduation Married
Master Divorced
PhD Widow
Alone
Absurd
YOLO

The dt_customer feature is currently set as an object data-type, displayed in Table 4.

Table 4. Snapshot of Dt_customer Feature
In [6]:
display(df[['dt_customer']].head())
dt_customer
0 04-09-2012
1 08-03-2014
2 21-08-2013
3 10-02-2014
4 19-01-2014

The statistics of relevant integer and float-type features can be viewed per category type to understand their average values and expected variances.

Table 5. Statistics for Features under People Category
In [7]:
col_people = ['year_birth',
              'income',
              'kidhome',
              'teenhome',
              'recency',
              'complain']
col_products = ['mntwines',
                'mntfruits',
                'mntmeatproducts',
                'mntfishproducts',
                'mntsweetproducts',
                'mntgoldprods']
col_promotion = ['numdealspurchases',
                 'acceptedcmp1',
                 'acceptedcmp2',
                 'acceptedcmp3',
                 'acceptedcmp4',
                 'acceptedcmp5',
                 'response']
col_place = ['numwebpurchases',
             'numcatalogpurchases',
             'numstorepurchases',
             'numwebvisitsmonth']
col_others = ['z_costcontact',
              'z_revenue']

df.describe().loc[['mean', 'std'], col_people]
Out[7]:
year_birth income kidhome teenhome recency complain
mean 1968.805804 52247.251354 0.444196 0.506250 49.109375 0.009375
std 11.984069 25173.076661 0.538398 0.544538 28.962453 0.096391

Based on the statistics displayed on Table 5., the average consumer base of the company has a household income of around 50,000 USD, likely has kids or teens at home, and unlikely to complain.

Table 6. Statistics for Features under Products Category
In [8]:
df.describe().loc[['mean', 'std'], col_products]
Out[8]:
mntwines mntfruits mntmeatproducts mntfishproducts mntsweetproducts mntgoldprods
mean 303.935714 26.302232 166.950000 37.525446 27.062946 44.021875
std 336.597393 39.773434 225.715373 54.628979 41.280498 52.167439

From Table 6. it can be concluded that the average consumer spends the most amount of money on wine at around 300 USD, followed by meat products at 150 USD.

Table 7. Statistics for Features under Promotion Category
In [9]:
df.describe().loc[['mean', 'std'], col_promotion]
Out[9]:
numdealspurchases acceptedcmp1 acceptedcmp2 acceptedcmp3 acceptedcmp4 acceptedcmp5 response
mean 2.325000 0.064286 0.013393 0.072768 0.074554 0.072768 0.149107
std 1.932238 0.245316 0.114976 0.259813 0.262728 0.259813 0.356274

Based on Table 7., on average, customers have purchased on a discount twice. Around 6-7% have accepted campaign offers, with the exception of Campaign 2 which was the least successful, capturing only around 1% of the customer base. The most recent campaign was the most successful, with a 14% acceptance rate on average.

Table 8. Statistics for Features under Place Category
In [10]:
df.describe().loc[['mean', 'std'], col_place]
Out[10]:
numwebpurchases numcatalogpurchases numstorepurchases numwebvisitsmonth
mean 4.084821 2.662054 5.790179 5.316518
std 2.778714 2.923101 3.250958 2.426645

The Table 8. shows that on average, customers are most likely to purchase their products from the store, followed by via the web, and are the least likely to purchase through catalog. Additionally, customers visit the company website around 5 times on a monthly average.

Table 9. Statistics for Features under Others Category
In [11]:
df.describe().loc[['mean', 'std'], col_others]
Out[11]:
z_costcontact z_revenue
mean 3.0 11.0
std 0.0 0.0

The z_costcontact and z_revenue features are deemed to be unnecessary features as they are unknown variables with constant values throughout the whole dataset, as shown in Table 9.

Using the statistics above, the average customer of the company can be inferred to have the following characteristics.

Baseline Average Customer:

They are likely born between 1963 to 1973, with an average annual income of 50,000 USD, and likely to have children at home. They most likely spend the most amount of wines, followed by meat products. They have a 6-7% chance of accepting campaign offers, and are most likely to purchase from the store.

Methodology Overview

No description has been provided for this image
Figure 1. Methology Overview

Captured in Figure 1. is the high-level overview of the methodology pipeline of the study. Detailed below in Table 10. are the details of the methodology pipeline to be conducted to address the problem of determining customer segments of the company.
Table 10. Methology Details
Step Process Description
1 Data Cleaning and Preprocessing Prepare the dataset by handling missing values, mapping ordinal values, one-hot encoding nominal features, and excluding unnecessary features
2 Dimensionality Reduction Standardize the data and perform dimensionality reduction via PCA
3 Hierarchical-based Clustering Plot the dendrogram to determine threshold and perform clustering via Ward's method to determine the clusters
4 Results and Discussion Analyze the results of the clusters and sub-clusters to produce insights on each customer segment relating to their characteristics and preferences
5 Conclusion Summarize the insights to create customer profiles, and suggest a marketing strategy per segment based on the results
6 Recommendation Provide recommendations for future studies on possible further improvements that could be made given the limitations of the current project

Data Cleaning and Preprocessing

Handling of Missing Values

The first step to be conducted for data preparation is the identification of any missing or null values in the dataset. Missing values could be an indication of issues to certain data entries and may introduce noise that could lead to inaccurate clustering, and misinterpretation of the results.

Table 11. Null Value Count per Feature
In [12]:
df.isnull().sum()
Out[12]:
id                      0
year_birth              0
education               0
marital_status          0
income                 24
kidhome                 0
teenhome                0
dt_customer             0
recency                 0
mntwines                0
mntfruits               0
mntmeatproducts         0
mntfishproducts         0
mntsweetproducts        0
mntgoldprods            0
numdealspurchases       0
numwebpurchases         0
numcatalogpurchases     0
numstorepurchases       0
numwebvisitsmonth       0
acceptedcmp3            0
acceptedcmp4            0
acceptedcmp5            0
acceptedcmp1            0
acceptedcmp2            0
complain                0
z_costcontact           0
z_revenue               0
response                0
dtype: int64

As observed in Table 11., there are 24 customers out of the total 2240 that have missing or blank entries for their income. To handle these missing values, the decision was to remove these entries altogether rather than to impute their values due to the following reasons:

  • Because the purpose of the study is to conduct clustering, imputation may impact the clustering negatively and result in innacurate results, given that there are assumptions being made about an entries income feature.
  • Removal of entries with the missing feature will cause minimal impact to the entire dataset as it only represent 1% of the customer base.
In [13]:
df.dropna(inplace=True)

print(f'Missing Values Remaining: {df.isnull().sum().sum()}')
Missing Values Remaining: 0

Mapping of Ordinal Features

The next step is to handle categorical features in order to transform them to a data-type suitable for clustering. For ordinal features, mapping can be conducted to assign an integer based on its inherent ranking. The education of the customer is selected as an ordinal features, with the following ranking specified in Table 12.

Table 12. Ordinal Mapping for Education Feature
Original Value Mapped Value
Basic 0
Graduation 1
2n Cycle 2
Master 3
PhD 4
In [14]:
mapping_education = {
    'Basic': 0,
    'Graduation': 1,
    '2n Cycle': 2,
    'Master': 3,
    'PhD': 4
}
df['education'] = df['education'].map(mapping_education)

One-hot Encoding of Nominal Features

Categorical features without an inherent ranking can be handled by one-hot encoding them to transform these categories to binary data. The marital_status feature is one-hot encoded due to the lack of inherent ranking, and the existence of vague entries such as YOLO and Together. The results are reflected in Table 13.

Table 13. Snapshot of Resulting Columns from One-hot Encoding of Marital Status
In [15]:
col_nominal = ['marital_status']
for col in col_nominal:
    df_dummy = pd.get_dummies(
        df[col], prefix=col, drop_first=True).astype(int)
    df = pd.concat([df, df_dummy], axis=1)
df = df.drop(col_nominal, axis=1)
df.columns = df.columns.str.lower()

display(df[[col for col in df.columns if 'marital_status' in col]].head())
marital_status_alone marital_status_divorced marital_status_married marital_status_single marital_status_together marital_status_widow marital_status_yolo
0 0 0 0 1 0 0 0
1 0 0 0 1 0 0 0
2 0 0 0 0 1 0 0
3 0 0 0 0 1 0 0
4 0 0 1 0 0 0 0

Representation of Object-type Date as Integer

As seen earlier in the data exploration stage, the data-type of dt_customer feature is currently set as an object, and must be transformed to numerical data to make it compatible for clustering.

The feature, which represents the date of when the customer initially enrolled with the company, can be transformed to a datetime component. This can then be further transformed to an integer representation based on the duration of the customer's enrollment, tagged with the new feature enrollment_duration. The duration can be calculated based on the most recent enrollment date acting as the 'as of' date, with the resulting computation displayed in Table 14.

Table 14. Snapshot of New Enrollment Duration Feature
In [16]:
df['dt_customer'] = pd.to_datetime(df['dt_customer'], format='%d-%m-%Y')
latest_enrollment_date = df['dt_customer'].max()
df['enrollment_duration'] = (latest_enrollment_date - df['dt_customer']).dt.days

display(df[['dt_customer', 'enrollment_duration']].head())

df = df.drop('dt_customer', axis=1)
dt_customer enrollment_duration
0 2012-09-04 663
1 2014-03-08 113
2 2013-08-21 312
3 2014-02-10 139
4 2014-01-19 161

Feature Exclusion

Non-informative features should be considered for removal or exclusion from clustering.

Features z_costcontact and z_revenue are considered for feature exclusion because they only contain constant values throughout the entire database, and there is no other information about what they represent from the online source (Patel, 2021).

The id should be removed as it is only used as an identifier for the different customers. Additionally, since it comes as an integer data-type, it would introduce numerical values that have no informative meaning to the clustering.

In [17]:
col_drop = ['z_costcontact', 'z_revenue', 'id']
df = df.drop(col_drop, axis=1)

The result of the data preparation is found in Table 15., where all features are represented numerically, there are no missing values, and all unnecessary features are excluded. Our final preprocessed dataset contains 2216 customer entries and 32 features.

Table 15. Features of Preprocessed Customer Personality Analysis Data
In [18]:
df = df.astype(int)

df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2216 entries, 0 to 2239
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   year_birth               2216 non-null   int64
 1   education                2216 non-null   int64
 2   income                   2216 non-null   int64
 3   kidhome                  2216 non-null   int64
 4   teenhome                 2216 non-null   int64
 5   recency                  2216 non-null   int64
 6   mntwines                 2216 non-null   int64
 7   mntfruits                2216 non-null   int64
 8   mntmeatproducts          2216 non-null   int64
 9   mntfishproducts          2216 non-null   int64
 10  mntsweetproducts         2216 non-null   int64
 11  mntgoldprods             2216 non-null   int64
 12  numdealspurchases        2216 non-null   int64
 13  numwebpurchases          2216 non-null   int64
 14  numcatalogpurchases      2216 non-null   int64
 15  numstorepurchases        2216 non-null   int64
 16  numwebvisitsmonth        2216 non-null   int64
 17  acceptedcmp3             2216 non-null   int64
 18  acceptedcmp4             2216 non-null   int64
 19  acceptedcmp5             2216 non-null   int64
 20  acceptedcmp1             2216 non-null   int64
 21  acceptedcmp2             2216 non-null   int64
 22  complain                 2216 non-null   int64
 23  response                 2216 non-null   int64
 24  marital_status_alone     2216 non-null   int64
 25  marital_status_divorced  2216 non-null   int64
 26  marital_status_married   2216 non-null   int64
 27  marital_status_single    2216 non-null   int64
 28  marital_status_together  2216 non-null   int64
 29  marital_status_widow     2216 non-null   int64
 30  marital_status_yolo      2216 non-null   int64
 31  enrollment_duration      2216 non-null   int64
dtypes: int64(32)
memory usage: 571.3 KB

Dimensionality Reduction

Dimensionality reduction will be performed due to the following reasons:

  • Highly dimensional data can suffer the curse of dimensionality, as the data becomes more sparse. This can cause challenges during clustering, where results become less meaningful.
  • By reducing the number of dimensions, the computational complexity of the dataset can be reduced to improve computational efficiency.
  • Irrelevant or noise features that do not offer significant information gain can be reduced, potentially improving clustering results.

Principal Component Analysis (PCA) is the chosen method for its simplicity and efficiency. The dataset is small and non-sparse, making it manageable when performing PCA.

Data Standardization

The data is standardized in preparation for performing PCA. Standardizing the features will ensure that each will contribute evenly to the computation for principal components and will prevent features with larger magnitudes to dominate the calculation. The results of the standardization can be found in Table 16.

Table 16. Snapshot of Scaled Features
In [19]:
standard_scaler = StandardScaler()
df_scaled = standard_scaler.fit_transform(df.values)

display(pd.DataFrame(df_scaled).head())
0 1 2 3 4 5 6 7 8 9 ... 22 23 24 25 26 27 28 29 30 31
0 -0.986443 -0.819198 0.234063 -0.823039 -0.928972 0.310532 0.978226 1.549429 1.690227 2.454568 ... -0.097812 2.377952 -0.036819 -0.341958 -0.794110 1.924807 -0.590553 -0.188452 -0.030056 1.529129
1 -1.236801 -0.819198 -0.234559 1.039938 0.909066 -0.380509 -0.872024 -0.637328 -0.717986 -0.651038 ... -0.097812 -0.420530 -0.036819 -0.341958 -0.794110 1.924807 -0.590553 -0.188452 -0.030056 -1.188411
2 -0.318822 -0.819198 0.769478 -0.823039 -0.928972 -0.795134 0.358511 0.569159 -0.178368 1.340203 ... -0.097812 -0.420530 -0.036819 -0.341958 -0.794110 -0.519533 1.693329 -0.188452 -0.030056 -0.205155
3 1.266777 -0.819198 -1.017239 1.039938 -0.928972 -0.795134 -0.872024 -0.561922 -0.655551 -0.504892 ... -0.097812 -0.420530 -0.036819 -0.341958 -0.794110 -0.519533 1.693329 -0.188452 -0.030056 -1.059945
4 1.016420 1.529240 0.240221 1.039938 -0.928972 1.554407 -0.391671 0.418348 -0.218505 0.152766 ... -0.097812 -0.420530 -0.036819 -0.341958 1.259271 -0.519533 -0.590553 -0.188452 -0.030056 -0.951244

5 rows × 32 columns

PCA

To set-up the PCA, a random_state was selected and performed on the scaled data. The cumulative explained variance is used to determine how many principcal components to consider.

In [20]:
pca = PCA(random_state=42)
df_new = pca.fit_transform(df_scaled)

variance_explained = pca.explained_variance_ratio_
cumulative_variance_explained = variance_explained.cumsum()

fig, ax = plt.subplots()
ax.plot(range(1, len(variance_explained) + 1),
        variance_explained,
        '-',
        label='individual')
ax.set_xlim(0, len(variance_explained)+1)
ax.set_xlabel('Number of PCAs')
ax.set_ylabel('Variance explained')

ax = ax.twinx()
ax.plot(range(1, len(variance_explained) + 1),
        cumulative_variance_explained,
        'r-',
        label='cumulative')
ax.set_ylabel('Cumulative variance explained')

ax.axhline(0.81, ls='--', color='g')
ax.axvline(17, ls='--', color='g')
ax.set_title('Figure 2. Variance and Cumulative Variance Explained across PCAs');
No description has been provided for this image

From the results in Figure 2., we can reach a cumulative variance explained of 0.8 when using 17 principcal components. This will be the number of components used when performing dimensionality reduction. Displayed in Table 17. are the 17 retained principal components to be used during clustering.

Table 17. Snapshot of Scaled and Reduced Features
In [21]:
pca = PCA(n_components=17, random_state=42)
df_pca_reduced = pca.fit_transform(df_scaled)

display(pd.DataFrame(df_pca_reduced).head())
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 3.830161 -0.181321 0.167833 3.219346 -1.252481 -1.094959 -0.583047 0.165437 0.750438 -0.013983 -0.179083 -0.276510 0.263946 -0.477823 0.196159 -1.440021 1.344352
1 -2.435863 -0.638125 -0.241413 -0.860753 -0.924621 -1.612613 -0.846041 0.219046 0.197928 -0.640864 0.008982 0.472702 -1.303607 0.426551 -0.993351 -0.729119 0.924650
2 1.600887 -0.185814 -1.257625 -0.243510 -1.254225 1.250047 0.419811 0.540887 -0.018901 0.621133 0.143211 0.648513 -0.886527 -0.250879 0.712234 0.063698 -1.008799
3 -2.569622 -1.534893 0.002154 -0.245513 -1.094061 1.608984 0.435226 0.497322 -0.152752 0.552256 0.382589 0.483710 -0.574400 -0.288417 -0.387440 0.845467 -0.584655
4 -0.502419 -0.151635 -0.586862 -0.015381 1.647314 0.304772 -0.570991 0.467957 -0.900724 -0.458106 0.217740 -0.945133 1.616572 -0.205024 -0.852598 1.661447 0.257108

Plotting the datapoints on the first three principal components, the dataset can be visualized in Figure 3.

In [22]:
plot_3d(df_pca_reduced, y_predict=None)
Figure 3. 3D Plot of Dataset

Hierarchical-based Clustering

Ward's Method

Hierarchical-based Ward's method is the chosen clustering method due to the following reasons:

  • The plot of the dendrogram can be used to visually determine the number of clusters generated, and allows for the opportunity to identify sub-clusters beneath larger, main clusters. This is particularly useful for the study as it allows for flexibility on customer segmentation, allowing the company to further segment large customer sections to smaller groups when needed.
  • In comparison to other linkage methods, Ward's method is less sensitive to the shape of the cluster and tends to form compact, spherical clusters.

To perform the clustering, the dendrogram is first plotted to determine the epsilon threshold, which will in turn determine the total number of clusters and their corresponding datapoints.

In [24]:
Z = linkage(df_pca_reduced, method='ward', optimal_ordering=True)
plot_dendrogram(Z);
No description has been provided for this image

From the results of the dendrogram in Figure 4. two clusters can be formed when selecting the epsilon threshold between 90 to 120, where the largest branch gap can be found.

There is also potential for the segmentation of sub-clusters at around the epsilon=80 and epsilon=70 thresholds, for three clusters and four resulting clusters respectively.

In [25]:
y_predict_ward_two = fcluster(Z, t=120, criterion='distance')
plot_3d(df_pca_reduced, y_predict_ward_two)
Figure 5. 3D Plot for Two Clusters

The two main clusters can be plotted against the first three principal components, found in the Figure 5.. From visual inspection, the clustering shows some of the desired characteristics of what commonly characterizes good clustering:

  • Relatively compact for points within the cluster
  • Relatively separated from points outside the cluster
  • Parsimonous, consisting of only two clusters

The same process is conducted for the three and four-clusters, where the count of datapoints per cluster is plotted on Table 18. and Table 19. respectively.

Table 18. Customer Count for Three Clusters
In [26]:
y_predict_ward_three = fcluster(Z, t=80, criterion='distance')
display(pd.DataFrame.from_dict(Counter(y_predict_ward_three), orient='index', columns=['count']))
count
1 541
2 811
3 864
Table 19. Customer Count for Four Clusters
In [27]:
y_predict_ward_four = fcluster(Z, t=70, criterion='distance')
display(pd.DataFrame.from_dict(Counter(y_predict_ward_four), orient='index', columns=['count']).sort_index())
count
1 30
2 511
3 811
4 864

While four clusters is feasible, the ideal number of sub-clustering will be set to three instead. This is due to one cluster containing only 30 customers within the segment, which may not be particularly useful for the marketing and sales strategy and may be too niche of a segment.

The three clusters are visualized in Figure 6. against the first three principal components. Similar to the insights from the two clusters, the three segments also show the desired characteristics of good clustering from visual inspection.

In [28]:
plot_3d(df_pca_reduced, y_predict_ward_three)
Figure 6. 3D Plot for Three Clusters

The resulting cluster and sub-clusters can be projected back to the features of the original dataframe in order to make inferences on the distinct characteristics of each cluster type and create customer profiles. Table 20. displays each customer, their features and what cluster and sub-clusters they are tagged under.
Table 20. Customer Features and Corresponding Cluster and Sub-clusters
In [29]:
df_cluster = df.copy()
df_cluster['cluster'] = y_predict_ward_two

df_subcluster = df_cluster.copy()
df_subcluster['subcluster'] = y_predict_ward_three

display(df_subcluster.head(10))
year_birth education income kidhome teenhome recency mntwines mntfruits mntmeatproducts mntfishproducts ... marital_status_alone marital_status_divorced marital_status_married marital_status_single marital_status_together marital_status_widow marital_status_yolo enrollment_duration cluster subcluster
0 1957 1 58138 0 0 58 635 88 546 172 ... 0 0 0 1 0 0 0 663 1 1
1 1954 1 46344 1 1 38 11 1 6 2 ... 0 0 0 1 0 0 0 113 2 2
2 1965 1 71613 0 0 26 426 49 127 111 ... 0 0 0 0 1 0 0 312 1 1
3 1984 1 26646 1 0 26 11 4 20 10 ... 0 0 0 0 1 0 0 139 2 2
4 1981 4 58293 1 0 94 173 43 118 46 ... 0 0 1 0 0 0 0 161 1 1
5 1967 3 62513 0 1 16 520 42 98 0 ... 0 0 0 0 1 0 0 293 2 3
6 1971 1 55635 0 1 34 235 65 164 50 ... 0 1 0 0 0 0 0 593 2 3
7 1985 4 33454 1 0 32 76 10 56 3 ... 0 0 1 0 0 0 0 417 2 3
8 1974 4 30351 1 0 19 14 0 24 3 ... 0 0 0 0 1 0 0 388 2 2
9 1950 4 5648 1 1 68 28 0 6 1 ... 0 0 0 0 1 0 0 108 2 3

10 rows × 34 columns

Results and Discussion

To form customer profiles for each customer segment, the analysis of the average characteristics per cluster type was grouped into stages based on the feature categories: people, marital status, products, promotion and place.

In [30]:
col_people = ['year_birth', 'education', 'income', 'kidhome', 'teenhome', 'recency', 'complain', 'enrollment_duration']
col_marital_status = ['marital_status_alone', 'marital_status_divorced', 'marital_status_married', 'marital_status_single', 'marital_status_together', 'marital_status_widow', 'marital_status_yolo']

People Category

This category shows the insights for the unique identifiers that define a customer, such as their age and income, displayed in Table 21.

Table 21. People Category Averages per Cluster
In [31]:
df_subcluster.groupby(['cluster', 'subcluster']).agg('mean').loc[:, col_people]
Out[31]:
year_birth education income kidhome teenhome recency complain enrollment_duration
cluster subcluster
1 1 1969.523105 1.898336 75463.029575 0.057301 0.231054 49.580407 0.000000 353.693161
2 2 1971.678175 1.887793 35222.504316 0.762022 0.442663 49.236745 0.000000 314.610358
3 1965.697917 2.288194 53690.924769 0.381944 0.736111 48.446759 0.024306 389.937500
  • Subcluster 1: Of average age, with lower educational attainment, belonging the high-income bracket, and likely has no children.
  • Subcluster 2: Are younger, with lower educational attainment, belonging the lower-income bracket, and likely with young children at home.
  • Subcluster 3: Are older, with higher educational attainment, belonging the mid-income bracket, and likely with older children at home.

Marketing to Subcluster 1 could focus on premium products and services due to their high income. In terms of targeted approach. Subcluster 2, with younger children and lower income, should be more receptive to budget-friendly offerings and family deals. Subcluster 3, being older and more educated, appreciates more detailed information and a higher level of customer service.

Marital Status Category

This category shows the insights on the marital status of each segment, illustrated in Table 22.

Table 22. Marital Status Category Averages per Cluster
In [32]:
df_subcluster.groupby(['cluster', 'subcluster']).agg('mean').loc[:, col_marital_status]
Out[32]:
marital_status_alone marital_status_divorced marital_status_married marital_status_single marital_status_together marital_status_widow marital_status_yolo
cluster subcluster
1 1 0.000000 0.016636 0.434381 0.251386 0.292052 0.001848 0.000000
2 2 0.000000 0.000000 0.440197 0.271270 0.288533 0.000000 0.000000
3 0.003472 0.258102 0.306713 0.133102 0.209491 0.086806 0.002315
  • Subcluster 1: Likely married or with a partner, some single, and a few divorced.
  • Subcluster 2: Likely married or with a partner, and some single.
  • Subcluster 3: May be married or with a partner, many divorced, and a few widowed.

Subcluster 1, primarily consists of individuals who are married or in a partnership, with a significant portion being single and a smaller fraction divorced. This diversity suggests varying needs and preferences within the cluster. In Subcluster 2, the majority are either married or with a partner, with a notable number of singles. The absence of divorced or widowed individuals might indicate a more homogenous group in terms of life experiences. Subcluster 3, displays a more varied marital status distribution. Apart from those married or with partners, there is a substantial proportion of divorced and a noticeable number of widowed individuals, indicating a possibly older demographic with diverse life experiences.

Married or partnered individuals in Subclusters 1 and 2 respond to promotions targeting families or couples. In contrast, the diverse marital statuses in Subcluster 3 suggest the need for a more varied marketing approach.

Products Category

This category focuses on the purchasing behavior of different customer segments, specifically looking at their spending in various product categories, shown in Table 23.

Table 23. Products Category Averages per Cluster
In [33]:
df_subcluster.groupby(['cluster', 'subcluster']).agg('mean').loc[:, col_products]
Out[33]:
mntwines mntfruits mntmeatproducts mntfishproducts mntsweetproducts mntgoldprods
cluster subcluster
1 1 573.316081 66.340111 428.203327 95.051756 67.338262 77.646950
2 2 52.335388 5.727497 26.855734 8.914920 5.908755 16.168927
3 374.392361 20.682870 134.982639 28.648148 21.613426 48.966435
  • Subcluster 1: Even amount of wine and meat products bought
  • Subcluster 2: Fewest amount of products bought
  • Subcluster 3: Relatively high amount of wine bought, following by meat products

Subclaster 1, shows a balanced spending pattern with a significant expenditure on wines and meat products. This indicates a preference for these categories and possibly a higher disposable income. Subcluster 2, customers in this group exhibit the lowest spending across all categories. This could reflect a lower purchasing power or a different set of priorities and preferences. Subcluster 3, has a relatively high expenditure on wines, followed by meat products. Their spending pattern is between Subcluster 1 and Subcluster 2, suggesting a moderate level of disposable income and a preference for certain luxury or high-quality products.

Marketing efforts directed to Subcluster 2, needs to focus more on value-for-money products and promotions that highlight affordability. This group could be more responsive to discounts and bundle deals. The popularity of wines in Subclusters 1 and 3 suggests a potential market for exclusive wine-related products or events. There's potential for cross-selling and upselling based on these spending patterns. For instance, customers in Subcluster 1 who spend heavily on wines and meats might be interested in gourmet food pairings or luxury kitchenware.

In [34]:
df_subcluster['amt_spent'] = (
    df_subcluster['mntwines'] +
    df_subcluster['mntfruits'] +
    df_subcluster['mntmeatproducts'] +
    df_subcluster['mntfishproducts'] +
    df_subcluster['mntsweetproducts'] +
    df_subcluster['mntgoldprods']
)

scatter = plt.scatter(
    x=df_subcluster['amt_spent'],
    y=df_subcluster['income'],
    c=df_subcluster['subcluster'],
    alpha=0.8)

plt.title("Figure 7. Income and Spending per Subcluster")
plt.xlabel('Amount Spent on Products')
plt.ylabel('Income')
plt.xlim(left=0, right=2600)
plt.ylim(bottom=0, top=200000)
plt.grid(True, linestyle='--', alpha=0.5)
handles, labels = scatter.legend_elements()
plt.legend(handles, labels, title='Subclusters')
plt.tight_layout()
plt.show()
No description has been provided for this image

Plotting the income and amount spent on Figure 7., the Subcluster 1 spends more expenditure is on the estimated range 1,000 to 2,500 with income above 50,000, while the Subcluster 2 and Subcluster 3 are scattered from estimated range 0 to 2,500 with income with estimated range of 0 to 75,000.

Promotions Category

This category focuses on how different customer segments respond to marketing campaigns and their propensity to make purchases through deals, displayed in Table 24.

Table 24. Promotions Category Averages per Cluster
In [35]:
df_subcluster.groupby(['cluster', 'subcluster']).agg('mean').loc[:, col_promotion]
Out[35]:
numdealspurchases acceptedcmp1 acceptedcmp2 acceptedcmp3 acceptedcmp4 acceptedcmp5 response
cluster subcluster
1 1 1.510166 0.205176 0.055453 0.068392 0.103512 0.231054 0.240296
2 2 2.028360 0.002466 0.000000 0.000000 0.000000 0.000000 0.062885
3 3.109954 0.033565 0.000000 0.145833 0.125000 0.042824 0.175926
  • Subcluster 1: Least likely to purchase with discount, Frequently accepts campaigns, High success for campaign 1, 5 and recent
  • Subcluster 2: Sometimes purchases with a discount, Almost never accepts campaign, few success with campaign 1
  • Subcluster 3: Frequently purchases with a discount, Sometimes accepts campaigns, Relative success for campaign 3, 4 and recent

Subcluster 1 is characterized by a low frequency of purchasing with discounts but shows a higher likelihood of responding to marketing campaigns, particularly Campaigns 1, 5, and the most recent campaign. This suggests a segment that is less price-sensitive but more responsive to targeted marketing efforts. Customers in Subcluster 2 occasionally make purchases with discounts but have a very low rate of responding to marketing campaigns. This indicates a segment that is somewhat price-conscious but generally indifferent to marketing initiatives. Subcluster 3 frequently purchases with discounts and shows some responsiveness to marketing campaigns, particularly Campaigns 3, 4, and the recent campaign. This suggests a segment that is both price-sensitive and somewhat receptive to marketing. Subcluster 1 is more receptive to exclusive, non-discounted offers, while Subcluster 3 respond better to promotions that offer clear value or discounts.

For price sensitivity, Subcluster 2’s tendency to sometimes purchase with discounts but low campaign acceptance rate could indicate a segment that is opportunistic in its purchasing behavior, seeking deals but not actively engaging with marketing efforts. This suggests a need for more compelling value propositions or alternative engagement strategies.

Place Category

This category provides insights into how each customer subcluster prefers to make purchases (web, catalog, or in-store) and their engagement with the company website, illustrated in Table 25.

Table 25. Place Category Averages per Cluster
In [36]:
df_subcluster.groupby(['cluster', 'subcluster']).agg('mean').loc[:, col_place]
Out[36]:
numwebpurchases numcatalogpurchases numstorepurchases numwebvisitsmonth
cluster subcluster
1 1 5.162662 5.659889 8.434381 3.020333
2 2 2.282367 0.600493 3.453761 6.353884
3 5.103009 2.743056 6.355324 5.787037
  • Subcluster 1: Likely to purchase from store, followed by catalog or web, least likely to visit company website
  • Subcluster 2: Likely to purchase from store or web, infrequent purchase from catalog, frequently visits company website
  • Subcluster 3: Likely to purchase from store or web, followed by catalog, sometimes visits company website

Subcluster 1 shows a strong preference for in-store purchases, followed by catalog and web purchases. They are the least likely to visit the company's website. This suggests a segment that values the physical shopping experience and possibly personal interaction. Subcluster 2 are inclined to make purchases both in-store and through the web, but they infrequently purchase through catalogs. They also frequently visit the company website, indicating a higher level of online engagement. Subcluster 3 prefers to purchase from stores or through the web, followed by catalog purchases. Their frequency of website visits is moderate, suggesting a balanced approach to both online and offline shopping experiences.

For Subcluster 1, enhancing the in-store experience and catalog design could lead to increased customer satisfaction and sales. For Subclusters 2 and 3, improving the online shopping experience and website usability might be more impactful. The high frequency of web visits by Subcluster 2 offers an opportunity to gather more data on customer preferences and behaviors through their online interactions, which can be used to further personalize offers and recommendations.

Conclusion

Using the hierachical-based Ward's method of clustering, the study identified the distinct segments within the data and the key characteristics that differentiate these segments, displayed in Figure 8.

No description has been provided for this image
Figure 8. The Customer Segments

Cluster 1: The Affluent Traditionalists

Subcluster 1 The Premier Traditionalists: These inviduals are characterized by a preference for premium goods, responsiveness to marketing campaigns, and a tendency for in-store shopping. This subcluster represents the premium and engaged segment emphasized by their high-end purchasing preferences and the significant expenditure on premium categories like wines and meats.

Cluster 2: The Digital Economizers

Subcluster 2 The Budget-Focused Families: These individuals are marked by budget-consciousness, digital engagement, and a lower response to marketing campaigns. This subcluster embodies the value-conscious and digital-engager segment highlighted by their frequent web visits and balanced online and offline purchasing behavior suggest comfort with digital platforms, coupled with a focus on economical choices.

Subcluster 3 The Flexible Consumers: These indiduduals are characterized with their moderate spending and balanced approach to shopping and campaign responsiveness. This subcluster captures the moderate and diverse shopper segment highlighted by their versatility in adapting to different shopping modes and responsiveness to various marketing campaigns.

Clustering enables the company to tailor its marketing, sales, and operational strategies to each unique segment. Capturing each segment's preferences and behaviors allows for the development of targeted marketing and sales efforts, ensuring the optimization of budget and resources. This targeted approach not only enhances customer experience but also bolsters long-term retention by aligning the company's offerings and communication with the specific needs and preferences of each segment. Lastly, due to clustering, allocating resources more efficiently by focusing efforts on the most profitable or responsive customer segments.

The main clusters strategic roadmap for tailoring marketing and engagement efforts. For Main Cluster 1, the focus should be on premium offerings, personalized services, and enhancing the in-store experience. For Main Cluster 2, strategies should revolve around digital engagement, value-based promotions, and accommodating the needs of budget-conscious families.

Recommendations

The application of hierarchical clustering with the Ward's method has yielded insightful customer segmentations. To further enhance the effectiveness of this approach, it is recommended that future research builds on these findings, address the below outlined limitations and integrating advanced methodologies. This continued effort will be instrumental in refining the segmentation strategy and maintaining the company's relevance in a dynamic market landscape.

Limitations

  • The current analysis is based on a specific dataset, which may not capture the full spectrum of the customer base. There is a potential limitation in the diversity of the data, possibly overlooking emerging customer segments or underrepresented demographics. An example of this is the nature of the previous marketing campaigns used by the company. Marketing campaigns are targeted to certain segments and are often changing.
  • The study analyzes customer behavior at a specific point in time. Consumer preferences and market dynamics are constantly evolving; hence, the findings might not fully encapsulate future trends or shifts in consumer behavior.
  • The data lacks qualitative insights such as customer motivations, preferences, and perceptions. These aspects are critical in understanding the deeper reasons behind purchasing decisions and customer loyalty.
  • The analysis primarily focuses on descriptive and inferential statistics. Incorporating predictive modeling could provide foresight into future customer behaviors and market trends.
  • The study does not account for external factors like economic conditions, cultural trends, or competitive actions that can significantly influence customer behavior.

Integrating Advanced Methodologies

  • Integrating qualitative research methods such as interviews, focus groups, or surveys could offer invaluable insights into customer attitudes, motivations, and satisfaction levels
  • Implementing advanced predictive analytics and machine learning models could forecast future customer behaviors and market trends, aiding in proactive decision-making.
  • Considering the impact of external factors such as economic shifts, cultural changes, and competitive landscape dynamics to understand their influence on customer behavior.
  • Detailed mapping of customer journeys for each segment can provide deeper insights into various touchpoints and opportunities for enhancing customer experience.
  • Feature engineering to show success rates and marketing them of previous campaigns.
  • Using other clustering methods and preprocessing techniques.

References

Patel, A. (2021, August 23). Customer Personality Analysis. Www.kaggle.com. https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis